In the last two lecturs, we discuss a general model for learning, neural networks.
Google brain:
Elements of Statistical Learning (ESL) Chapter 11: https://web.stanford.edu/~hastie/ElemStatLearn/.
Aka, Single layer perceptron, single hidden layer back-propagation network.
Sum of nonlinear functions of linear combinations of the inputs, typically represented by a network diagram.
Output layer: \(Y=(Y_1, \ldots, Y_K)\) are \(K\)-dimensional output. E.g., for univariate response, \(K=1\); for \(K\)-class classification, \(k\)-th unit models the probability of class \(k\).
Input layer: \(X=(X_1, \ldots, X_p)\) are \(p\)-dimensional input features.
Hidden layer: \(Z=(Z_1, \ldots, Z_M)\) are derived features created from linear combinations of inputs \(X\).
\(T=(T_1, \ldots, T_K)\) are the output features that are directly associated with the outputs \(Y\) through output functions \(g_k(\cdot)\).
\(g_k(T) = T\) for regression. \(g_k(T) = e^{T_k} / \sum_{k=1}^K e^{T_k}\) for \(K\)-class classification.
Number of weights (parameters) is \(M(p+1) + K(M+1)\).
sigmoid function: \[ \sigma(v) = \frac{1}{1 + e^{-v}}. \]
\(\sigma(v)=\) a step function: human brain models where each unit represents a neuron, and the connections represent synapses; the neurons fired when the total signal passed to that unit exceeded a certain threshold.
Rectifier. \(\sigma(v) = v_+ = max(0, v)\). A unit employing the rectifier is called a rectified linear unit (ReLU). According to Wikipedia:
> The rectifier is, as of 2018, the most popular activation function for deep neural networks.
Given training data \((X_1, Y_1), \ldots, (X_n, Y_n)\), the loss function \(L\) can be:
SSE: \[ L = \sum_{k=1}^K \sum_{i=1}^n [y_{ik} - f_k(x_i)]^2. \]
Cross-entropy (deviance) \[ L = - \sum_{k=1}^K \sum_{i=1}^n y_{ik} \log f_k(x_i). \]
Model fitting: back propagation (gradient descent)
Back propagation equations \[ s_{mi} = \sigma'(\alpha_m^T x_i) \sum_{k=1}^K \beta_{km} \delta_{ki}. \]
Two-pass updates: initialization \(\to \hat f_k(x_i) \to \delta_{ki} \to s_{mi} \to \hat \beta_{km} \text{ and } \hat \alpha_{ml}\).
\(\gamma_r\) is the learning rate.
Advantages: simple and local nature; each hidden units passes and receives information only to and from units that share a connection; can be implemented efficiently on a parallel architecture computer.
Alternative fitting methods: conjugate gradients, variable metric methods.
Aka multi-layer perceptron (MLP).
Starting values: usually starting values for weights are chosen to be random values near zero; hence the model starts out nearly linear, and becomes nonlinear as the weights increase.
Overfitting: early stopping; weight decay by \(L_2\) penalty
\[
\frac{\lambda}{2} (\sum_{k, m} \beta_{km}^2 + \sum_{m, l} \alpha_{ml}^2).
\] \(\lambda\) is the weight decay parameter.
Scaling of inputs: mean 0 and standard deviation 1.
How many hidden units and how many hidden layers: guided by domain knowledge and experimentation.
Multiple minima: try with different starting values.
Neural network model is a projectin pursuit type additive model: \[ f(X) = \beta_0 + \sum_{m=1}^M \beta_m \sigma(\alpha_{m0} + \alpha_M^T X). \]
Neural networks are not a fully automatic tool, as they are sometimes advertised; as with all statistical models, subject matter knowledge should and often be used to improve their performance.
Sources: https://colah.github.io/posts/2014-07-Conv-Nets-Modular/
Fully connected networks don’t scale well with dimension of input images. E.g. \(96 \times 96\) images have about \(10^4\) input units, and assuming you want to learn 100 features, you have about \(10^6\) parameters to learn.
In locally connected networks, each hidden unit only connects to a small contiguous region of pixels in the input, e.g., a patch of image or a time span of the input audio.
Consider \(96 \times 96\) images. For each hidden unit, first learn a \(8 \times 8\) feature detector from randomly sampled \(8 \times 8\) patches from the larger image. Then apply the learned detector to the all \(8 \times 8\) regions of the \(96 \times 96\) image to obtain \(89 \times 89\) convolved features for that hidden unit.
Input: 256 pixel values from \(16 \times 16\) grayscale images. Output: 0, 1, …, 9 10 class-classification.
A modest experiment subset: 320 training digits and 160 testing digits.
| network | links | weights | accuracy |
|---|---|---|---|
| net 1 | 2570 | 2570 | 80.0% |
| net 2 | 3124 | 3214 | 87.0% |
| net 3 | 1226 | 1226 | 88.5% |
| net 4 | 2266 | 1131 | 94.0% |
| net 5 | 5194 | 1060 | 98.4% |
ImageNet dataset.
Novel techniques: GPU, ReLU, DropOut.
Souces: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://karpathy.github.io/2015/05/21/rnn-effectiveness/
MLP and CNN accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes).
Recurrent neural networks (RNN) allow us to operate over sequences of vectors: sequences in the input, the output, or in the most general case both.
Applications:
NLP/Speech: transcribe speech to text, machine translation, generate handwritten text, …
Computer vision: image captioning, video captioning, …
RNNs accept an input vector \(x\) and give you an output vector \(y\). However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in in the past.